Notation
Each observation’s outcome \(Y\) (continuous or categorical) is to be predicted with function of auxiliary knowledge, i.e. covariates, from that observation \(X=x\) , denoted as \(\hat Y(x)\)
Want \(\hat Y(x)\) to be close to \(Y\) , but need to define what is meant by “close”
Use ‘mean squared prediction error’ (MSPE) for now: \((Y - \hat Y(x))^2\)
Numeric example (binary outcome)
Generate \(Y\) from logistic regression model: \(\Pr(Y = 1|X=x) = 1/(1+\exp(-\alpha+ x\beta))\)
One observation consists of \(\{Y, x\}\)
\(n = 100\) observations each in training and validation
\(p = 20\) covariates, distributed as independent normal random variables
\(p/2 = 10\) of coefficients equal to 0.25; remaining equal to 0
Expected prevalence of about 0.6
\(\hat Y(x)=\hat{\Pr}(Y=1|X=x) = 1/(1+\exp(-\hat\alpha+ x\hat\beta))\)
set.seed (7201969 );
#n = 400 observations; 1:1 for training:validation
n <- 400 ;
training_subset <- 1 : round (n / 2 );
validation_subset <- (round (n / 2 ) + 1 ): n;
#p covariates
p <- 20 ;
#baseline prevalence approximately 0.6
alpha <- log (0.6 / 0.4 ); #0.405
#normally distributed covariates, correlation 0.1
x <- matrix (rnorm (n* p),
nrow = n);
x <- x %*% chol (0.1 + diag (0.9 , p))
colnames (x) <- glue ("x{1:p}" )
#half of covariates have odds ratios exp(0.25)
beta <- c (rep (0.25 , floor (p/ 2 )), numeric (ceiling (p/ 2 )));
true_probs <- 1 / (1 + exp (- alpha - drop (x%*% beta)));
y <- rbinom (n, 1 , true_probs);
all_data <- bind_cols (y = y, data.frame (x));
full_fmla <-
glue ("y~{glue_collapse(glue('x{1:p}'),sep='+')}" ) %>%
as.formula ();
Eight strategies considered:
True model (‘(truth)’; for benchmarking)
Intercept only (‘null’; not considered for selection)
All covariates (‘full’)
Forward selection (‘forward’): start with the null model, incrementally add in covariates that seem to improve fit
Forward selection with max of 1 variable / 25 observations in training data (‘forward_pruned’)
Backward selection (‘backward’): start will the full model, incrementally subtract covariates that seem to harm fit
Logistic regression with firth penalty (‘firth’): penalized logistic regression
Logistic regression with lasso penalty
full_model <- glm (full_fmla,
data = all_data,
subset = training_subset,
family = "binomial" );
null_model <- glm (y ~ 1 ,
data = all_data,
subset = training_subset,
family = "binomial" );
#forward selection (using MASS package)
forward_model <-
stepAIC (null_model,
scope = list (upper = full_fmla),
direction = "forward" ,
trace = F);
#pruned forward selection: max of 1 coefficient per 25 training observations
forward_pruned_model <-
stepAIC (null_model,
scope = list (upper = full_fmla),
direction = "forward" ,
steps = max (1 , floor (length (training_subset) / 25 )),
trace = F);
#backward selection
backward_model <-
stepAIC (full_model,
direction = "backward" ,
trace = F);
# Firth logistic regression
firth_model <-
logistf (full_fmla,
data = slice (all_data, training_subset))
# Lasso regression
lasso_model <-
cv.glmnet (full_fmla,
data = all_data,
subset = training_subset,
family = "binomial" , alpha = 1 )
Selection
tidy (full_model) %>%
left_join (tibble (term = glue ("x{1:floor(p/2)}" ),
nonzero = 1 )) %>%
mutate (nonzero = replace_na (nonzero, 0 )) %>%
filter (term != "(Intercept)" ) %>%
arrange (p.value) %>%
print (n = Inf );
# A tibble: 20 × 6
term estimate std.error statistic p.value nonzero
<glue> <dbl> <dbl> <dbl> <dbl> <dbl>
1 x6 0.427 0.178 2.40 0.0163 1
2 x3 0.391 0.176 2.22 0.0262 1
3 x4 0.342 0.189 1.81 0.0698 1
4 x5 0.324 0.178 1.81 0.0698 1
5 x7 0.288 0.167 1.72 0.0847 1
6 x11 -0.327 0.207 -1.58 0.114 0
7 x9 0.281 0.192 1.46 0.143 1
8 x1 0.192 0.154 1.25 0.212 1
9 x14 -0.170 0.166 -1.02 0.307 0
10 x13 -0.160 0.167 -0.962 0.336 0
11 x8 0.168 0.179 0.941 0.346 1
12 x18 -0.154 0.176 -0.874 0.382 0
13 x20 -0.133 0.180 -0.739 0.460 0
14 x19 0.0909 0.165 0.550 0.582 0
15 x10 0.0951 0.174 0.547 0.584 1
16 x17 0.106 0.196 0.542 0.588 0
17 x16 0.0938 0.173 0.542 0.588 0
18 x2 0.0286 0.170 0.168 0.866 1
19 x12 -0.0104 0.200 -0.0521 0.958 0
20 x15 -0.00586 0.175 -0.0334 0.973 0
tidy (forward_model) %>%
left_join (tibble (term = glue ("x{1:floor(p/2)}" ),
nonzero = 1 )) %>%
mutate (nonzero = replace_na (nonzero, 0 )) %>%
filter (term != "(Intercept)" ) %>%
arrange (p.value) %>%
print (n = Inf );
# A tibble: 6 × 6
term estimate std.error statistic p.value nonzero
<glue> <dbl> <dbl> <dbl> <dbl> <dbl>
1 x6 0.432 0.162 2.66 0.00773 1
2 x3 0.384 0.168 2.29 0.0218 1
3 x4 0.359 0.170 2.11 0.0352 1
4 x5 0.344 0.168 2.04 0.0409 1
5 x11 -0.324 0.192 -1.69 0.0908 0
6 x7 0.227 0.152 1.49 0.136 1
tidy (forward_pruned_model) %>%
left_join (tibble (term = glue ("x{1:floor(p/2)}" ),
nonzero = 1 )) %>%
mutate (nonzero = replace_na (nonzero, 0 )) %>%
filter (term != "(Intercept)" ) %>%
arrange (p.value) %>%
print (n = Inf );
# A tibble: 6 × 6
term estimate std.error statistic p.value nonzero
<glue> <dbl> <dbl> <dbl> <dbl> <dbl>
1 x6 0.432 0.162 2.66 0.00773 1
2 x3 0.384 0.168 2.29 0.0218 1
3 x4 0.359 0.170 2.11 0.0352 1
4 x5 0.344 0.168 2.04 0.0409 1
5 x11 -0.324 0.192 -1.69 0.0908 0
6 x7 0.227 0.152 1.49 0.136 1
tidy (backward_model) %>%
left_join (tibble (term = glue ("x{1:floor(p/2)}" ),
nonzero = 1 )) %>%
mutate (nonzero = replace_na (nonzero, 0 )) %>%
filter (term != "(Intercept)" ) %>%
arrange (p.value) %>%
print (n = Inf );
# A tibble: 6 × 6
term estimate std.error statistic p.value nonzero
<glue> <dbl> <dbl> <dbl> <dbl> <dbl>
1 x6 0.432 0.162 2.66 0.00773 1
2 x3 0.384 0.168 2.29 0.0218 1
3 x4 0.359 0.170 2.11 0.0352 1
4 x5 0.344 0.168 2.04 0.0409 1
5 x11 -0.324 0.192 -1.69 0.0908 0
6 x7 0.227 0.152 1.49 0.136 1
tibble (term = firth_model$ terms, estimate = firth_model$ coefficients, p.value = firth_model$ prob) %>%
left_join (tibble (term = glue ("x{1:floor(p/2)}" ),
nonzero = 1 )) %>%
mutate (nonzero = replace_na (nonzero, 0 )) %>%
filter (term != "(Intercept)" ) %>%
arrange (p.value) %>%
print (n = Inf );
# A tibble: 20 × 4
term estimate p.value nonzero
<glue> <dbl> <dbl> <dbl>
1 x6 0.377 0.0211 1
2 x3 0.344 0.0337 1
3 x5 0.287 0.0828 1
4 x4 0.302 0.0840 1
5 x7 0.250 0.105 1
6 x11 -0.287 0.133 0
7 x9 0.246 0.168 1
8 x1 0.171 0.234 1
9 x14 -0.150 0.336 0
10 x13 -0.144 0.357 0
11 x8 0.147 0.380 1
12 x18 -0.138 0.405 0
13 x20 -0.119 0.482 0
14 x17 0.0970 0.598 0
15 x19 0.0819 0.599 0
16 x10 0.0825 0.613 1
17 x16 0.0820 0.614 0
18 x2 0.0271 0.866 1
19 x15 -0.00676 0.967 0
20 x12 -0.00648 0.973 0
tidy (lasso_model$ glmnet.fit, return_zeros = TRUE ) %>% filter (lambda == lasso_model$ lambda.1 se) %>%
left_join (tibble (term = glue ("x{1:floor(p/2)}" ),
nonzero = 1 )) %>%
mutate (nonzero = replace_na (nonzero, 0 )) %>%
filter (term != "(Intercept)" ) %>%
arrange (- abs (estimate)) %>%
print (n = Inf );
# A tibble: 20 × 6
term step estimate lambda dev.ratio nonzero
<glue> <dbl> <dbl> <dbl> <dbl> <dbl>
1 x3 5 0.120 0.0780 0.0450 1
2 x4 5 0.104 0.0780 0.0450 1
3 x6 5 0.0582 0.0780 0.0450 1
4 x5 5 0.0442 0.0780 0.0450 1
5 x1 5 0 0.0780 0.0450 1
6 x2 5 0 0.0780 0.0450 1
7 x7 5 0 0.0780 0.0450 1
8 x8 5 0 0.0780 0.0450 1
9 x9 5 0 0.0780 0.0450 1
10 x10 5 0 0.0780 0.0450 1
11 x11 5 0 0.0780 0.0450 0
12 x12 5 0 0.0780 0.0450 0
13 x13 5 0 0.0780 0.0450 0
14 x14 5 0 0.0780 0.0450 0
15 x15 5 0 0.0780 0.0450 0
16 x16 5 0 0.0780 0.0450 0
17 x17 5 0 0.0780 0.0450 0
18 x18 5 0 0.0780 0.0450 0
19 x19 5 0 0.0780 0.0450 0
20 x20 5 0 0.0780 0.0450 0
predict_models <-
list (null = null_model,
full = full_model,
forward = forward_model,
forward_pruned = forward_pruned_model,
backward = backward_model,
firth = firth_model,
lasso = lasso_model) %>%
map (predict, newdata = all_data, type = 'resp' ) %>%
map (as.numeric) %>%
bind_cols () %>%
mutate (true_probs = true_probs,
y = y,
training = row_number () %in% training_subset) %>%
pivot_longer (c (true_probs, null: lasso),
names_to = "model_name" ) %>%
mutate (model_name = factor (model_name) %>% fct_inorder ())
Training MSPEs
\(\dfrac{1}{200}\sum_{i=1}^{200} (Y_i - \hat Y(x_i))^2\)
predict_models %>%
filter (training) %>%
group_by (model_name) %>%
summarize (mspe = mean ((y - value)^ 2 ))
# A tibble: 8 × 2
model_name mspe
<fct> <dbl>
1 true_probs 0.215
2 null 0.245
3 full 0.200
4 forward 0.209
5 forward_pruned 0.209
6 backward 0.209
7 firth 0.200
8 lasso 0.230
Full model has smallest MSPE in training subset
Validation MSPEs
\(\dfrac{1}{200}\sum_{i=201}^{400} (Y_i - \hat Y(x_i))^2\)
predict_models %>%
filter (! training) %>%
group_by (model_name) %>%
summarize (mspe = mean ((y - value)^ 2 ))
# A tibble: 8 × 2
model_name mspe
<fct> <dbl>
1 true_probs 0.205
2 null 0.244
3 full 0.226
4 forward 0.229
5 forward_pruned 0.229
6 backward 0.229
7 firth 0.225
8 lasso 0.234
Except for true and null models, all MSPEs increase
Firth model has smallest MSPE in validation subset
An aside: even though it has the smallest MSPE, is it still good in an absolute sense? What MSPE do we get from using \(\hat Y(x_i)\equiv 0.5\) ?
Assuming we report firth model as the model, is this the MSPE we should expect in the future?
Simulation study
\(n = 200;200;200\) observations in training;validation;testing
Same generating model as previously but repeat 500 times
In each simulated dataset, only three models are taken to the testing step: the true and null models (for benchmarking) and whichever other model has best validation MSPE
observed_results <-
all_results %>%
group_by (sim) %>%
#keep from the test step only the method we would have selected
filter (model_name %in% c ("(truth)" ,"null" ) |
step != "testing" |
mspe_ranking == min (mspe_ranking)) %>%
ungroup ();
model_colors = c ("black" ,brewer.pal (7 , "Dark2" ));
ggplot (observed_results) +
geom_boxplot (aes (x = step,
y = mspe,
color = model_name),
fill = "#AAAAAAAA" ,
outlier.shape = NA ,
varwidth = FALSE ) +
geom_hline (yintercept = 0.25 ) +
scale_color_manual (values = model_colors) +
labs (x = "" , y = "MSPE" , color = "Strategy" ) +
theme (text = element_text (size = 22 ),
legend.position = "top" );
change_in_optimism <-
observed_results %>%
dplyr:: select (- row_number, - absolute, - zero_one, - deviance) %>%
pivot_wider (names_from = step, values_from = mspe) %>%
mutate (validation_minus_training = validation - training,
testing_minus_validation = testing - validation) %>%
dplyr:: select (- training, - validation, - testing) %>%
pivot_longer (cols = contains ("minus" ), names_to = "delta" , values_to = "value" ) %>%
group_by (sim) %>%
filter (model_name %in% c ("(truth)" ,"null" ) |
delta == "validation_minus_training" |
mspe_ranking == min (mspe_ranking)) %>%
ungroup () %>%
mutate (delta = factor (delta,
levels = c ("validation_minus_training" ,
"testing_minus_validation" ),
labels = c ("training to validation" ,
"validation to testing" )));
ggplot (change_in_optimism) +
geom_boxplot (aes (x = delta,
y = value,
color = model_name),
fill = "#AAAAAAAA" ,
outlier.shape = NA ,
varwidth = FALSE ) +
scale_color_manual (values = model_colors) +
labs (x = "" ,
y = "Optimism (change in MSPE)" ,
color = "Strategy" ) +
theme (text = element_text (size = 22 ),
legend.position = "top" );
Model selection , conducted properly, adjusts for optimism in training
Model assessment , conducted properly, adjusts for regression to the mean; potential for optimism increases with variability of method
Overall prediction error
“How close are the actual outcomes to the predicted outcomes?”
\(\hat Y(x)=\hat{\Pr}(Y=1|X=x)\)
MSPE or Brier score : \(=(Y - \hat Y(x))^2\)
Absolute : \(=|Y - \hat Y(x)|\)
0-1 : \((1-Y)\times 1_{[ \hat Y(x)\geq0.5]} + Y\times 1_{[ \hat Y(x)<0.5]}\)
Deviance : \(-2(1-Y)\log[1- \hat Y(x)] -2Y\log \hat Y(x)\)
Error functions against \(\hat Y(x)\) when \(Y=1\)
prob_seq <- seq (0.01 , 1 , by = 0.001 );#
ggplot () +
geom_path (aes (x = prob_seq, y = (1 - prob_seq)^ 2 , color = "1Quadratic" ), size = 1 ) +
geom_path (aes (x = prob_seq, y = abs (1 - prob_seq), color = "2Absolute" ), size = 1 ) +
geom_path (aes (x = prob_seq, y = 1 * (prob_seq < 0.5 ), color = "30-1" ), size = 1 ) +
geom_path (aes (x = prob_seq, y = - 2 * log (prob_seq), color = "4Deviance" ), size = 1 ) +
coord_cartesian (ylim = c (0 , 2 )) +
scale_color_manual (labels = c ("Quadratic (MSPE)" , "Absolute" , "0-1" , "Deviance" ),
values = c ("#E41A1C" ,"#377EB8" ,"#4DAF4A" ,"#984EA3" )) +
labs (x = expression (hat (Y (x))),
y = "" ,
color = "Loss" ) +
theme (text = element_text (size = 22 ),
legend.position = "top" );
predict_models %>%
filter (! training) %>%
group_by (model_name) %>%
summarize (mspe = mean ((y - value)^ 2 ),
abs = mean (abs (y - value)),
zero_one = mean ((1 - y) * (value >= 0.5 ) + y * (value < 0.5 )),
deviance = - 2 * mean ((1 - y)* log (1 - value) + y* log (value)))
# A tibble: 8 × 5
model_name mspe abs zero_one deviance
<fct> <dbl> <dbl> <dbl> <dbl>
1 true_probs 0.205 0.398 0.335 1.19
2 null 0.244 0.489 0.42 1.36
3 full 0.226 0.427 0.36 1.30
4 forward 0.229 0.438 0.355 1.30
5 forward_pruned 0.229 0.438 0.355 1.30
6 backward 0.229 0.438 0.355 1.30
7 firth 0.225 0.433 0.36 1.28
8 lasso 0.234 0.476 0.41 1.32
Calibration
“Among observations with predicted prevalence of X%, is the true prevalence close to X%?”
overall error functions capture elements of calibration
Hosmer-Lemeshow test: group observations based upon \(\hat Y(x)\) , compare \(\sum_i \hat Y(x_i)\) to \(\sum_i Y_i\)
Discrimination
“Did the observations in which the outcome occured have a higher predicted risk than the observations in which the outcome did not occur?”
Sensitivity : probability of predicting \(\hat Y(x)=1\) given that, in truth, \(Y(x)=1\)
it is not the probability that \(Y(x)=1\) given that we’ve predicted \(\hat Y(x)=1\)
Specificity : probability of predicting \(\hat Y(x)=0\) given that, in truth, \(Y(x)=0\)
Receiver operator characteristic (ROC) curve : plot of \(\hat{\mathrm{sens}}(t)\) versus \(1-\hat{\mathrm{spec}}(t)\) for \(t\in[0,1]\) , where
concordance (\(c\) ) index : \(\dfrac{\sum_{i,j:Y_i=0,Y_j=1} 1_{[\hat Y(x_i) < \hat Y(x_j)]} + 0.5\times 1_{[\hat Y(x_i) = \hat Y(x_j)]}}{ \sum_{i,j:Y_i=0,Y_j=1} 1}\)
# use pROC package
true_training_roc <-
predict_models %>%
filter (training, model_name == "true_probs" ) %>%
roc (response = y, predictor = value)
true_validation_roc <-
predict_models %>%
filter (! training, model_name == "true_probs" ) %>%
roc (response = y, predictor = value)
firth_training_roc <-
predict_models %>%
filter (training, model_name == "firth" ) %>%
roc (response = y, predictor = value)
firth_validation_roc <-
predict_models %>%
filter (! training, model_name == "firth" ) %>%
roc (response = y, predictor = value)
roc_data <-
bind_rows (tibble (model = "(truth)" ,
step = "training" ,
sens = true_training_roc$ sensitivities,
spec = true_training_roc$ specificities,
thresh = true_training_roc$ thresholds),
tibble (model = "(truth)" ,
step = "validation" ,
sens = true_validation_roc$ sensitivities,
spec = true_validation_roc$ specificities,
thresh = true_validation_roc$ thresholds),
tibble (model = "firth" ,
step = "training" ,
sens = firth_training_roc$ sensitivities,
spec = firth_training_roc$ specificities,
thresh = firth_training_roc$ thresholds),
tibble (model = "firth" ,
step = "validation" ,
sens = firth_validation_roc$ sensitivities,
spec = firth_validation_roc$ specificities,
thresh = firth_validation_roc$ thresholds)) %>%
mutate (model = factor (model, levels = c ("(truth)" , "firth" ))) %>%
group_by (model, step) %>%
mutate (annotate = ifelse (abs (0.5 - thresh) == min (abs (0.5 - thresh)) & step == "validation" , TRUE , FALSE )) %>%
ungroup () %>%
arrange (model, step, desc (spec), sens);
roc_plot <-
ggplot (filter (roc_data),
aes (x = 1 - spec,
y = sens,
color = model,
linetype = step)) +
geom_path () +
geom_abline (intercept = 0 , slope = 1 ) +
geom_label_repel (data = filter (roc_data, annotate),
aes (label = paste0 ("t = " , formatC (thresh, format = "f" , digits = 1 ))),
nudge_x = - 0.051 ,
nudge_y = 0.051 ,
size = 4.5 ) +
scale_x_continuous (name = "Spec" ,
breaks = seq (0 , 1 , length = 11 ),
labels = formatC (seq (1 , 0 , length = 11 ), format = "g" , digits = 1 ),
expand = expand_scale (add = 0.02 )) +
scale_y_continuous (name = "Sens" ,
breaks = seq (0 , 1 , length = 11 ),
labels = formatC (seq (0 , 1 , length = 11 ), format = "g" , digits = 1 ),
expand = expand_scale (add = 0.02 )) +
scale_linetype_manual (name = "Step" ,
values = c ("dashed" , "solid" )) +
scale_color_manual (name = "Model" ,
values = model_colors[c (1 ,7 )]) +
theme (text = element_text (size = 16 ),
legend.position = "top" );
roc_plot;
(base ROC plot)
plot (firth_validation_roc)
Interpreting ROC curve
For the firth model:
For fixed specificity of 0.8, can plausibly expect sensitivity of about 0.45
If we classify based upon cutoff of \(\hat Y(x) > 0.5\) versus \(\hat Y(x) \leq 0.5\) ,we would expect sensitivity ~0.73 and specificity ~0.52
Achieving high sensitivity is difficult without sacrificing specificity
What does least ideal ROC curve look like?
Area under ROC curve (AUC or AUROC)
ROC curves are interesting if range of cutoffs are of interest
However, it has been shown that the area under the ROC curve is, in the case of logistic regression, the \(c\) -index:
\[\dfrac{\sum_{i,j:Y_i=0,Y_j=1} 1_{[\hat Y(x_i) < \hat Y(x_j)]}}{ \sum_{i,j:Y_i=0,Y_j=1} 1}\]
This estimates the probability of correctly ordering the risk of a randomly selected outcome and randomly selected non-outcome.
AUC
ggplot (filter (roc_data, model == "firth" , step == "validation" )) +
geom_path (aes (x = 1 - spec,#
y = sens)) +
geom_ribbon (aes (x = 1 - spec,
ymin = 0 ,
ymax = sens),
fill = "#777777" ) +
geom_abline (intercept = 0 , slope = 1 ) +
geom_text (data = tibble (x = 0.8 , y = 0.2 , label = "AUC" ),
aes (x = x,
y = y,
label = label),
size = 10 ) +
scale_x_continuous (name = "Spec" ,
breaks = seq (0 , 1 , length = 11 ),
labels = formatC (seq (1 , 0 , length = 11 ), format = "g" , digits = 1 ),
expand = expand_scale (add = 0 )) +
scale_y_continuous (name = "Sens" ,
breaks = seq (0 , 1 , length = 11 ),
labels = formatC (seq (0 , 1 , length = 11 ), format = "g" , digits = 1 ),
expand = expand_scale (add = 0 )) +
theme (text = element_text (size = 16 ),
legend.position = "none" );
AOC = 1 - AUC
library (grid);library (jpeg);#
g <- rasterGrob (readJPEG ("images/aoc.jpg" ), width= unit (1 ,"npc" ), height= unit (1 ,"npc" ), interpolate = FALSE )
ggplot (filter (roc_data, model == "firth" , step == "validation" )) +
annotation_custom (g, - Inf , Inf , - Inf , Inf ) +
geom_path (aes (x = 1 - spec,
y = sens)) +
geom_ribbon (aes (x = 1 - spec,
ymin = 0 ,
ymax = sens),
fill = "#777777" ) +
geom_abline (intercept = 0 , slope = 1 ) +
geom_text (aes (x = 0.15 ,
y = .8 ,
label = "AOC" ),
size = 10 ,
color = "#FFFFFF" ) +
geom_text (aes (x = 0.8 ,
y = .2 ,
label = "AUC" ),
size = 10 ) +
scale_x_continuous (name = "Spec" ,
breaks = seq (0 , 1 , length = 11 ),
labels = formatC (seq (1 , 0 , length = 11 ), format = "g" , digits = 1 ),
expand = expand_scale (add = 0 )) +
scale_y_continuous (name = "Sens" ,
breaks = seq (0 , 1 , length = 11 ),
labels = formatC (seq (0 , 1 , length = 11 ), format = "g" , digits = 1 ),
expand = expand_scale (add = 0 )) +
theme (text = element_text (size = 16 ),
legend.position = "none" );
Higher AUC is better
AUCs for ‘(true)’ and ‘firth’ in training step:
Area under the curve: 0.705
Area under the curve: 0.7375
AUCs for ‘(true)’ and ‘firth’ in validation step:
Area under the curve: 0.7235
Area under the curve: 0.6734
What is AUC under the intercept-only (null) model?
Summary thoughts on model assessment
Potential for optimism increases with inherent variability of model building method
Summary thoughts on model assessment
Be specific when saying the model was ‘validated’, and don’t assume others mean what they think they mean:
Does this refer to validation (model selection) or testing (model assessment)?
How was validation measured?
If cross-validation was involved, was it done properly?
Was the set of models considered sensible?
What population did validation and testing occur in?
Summary thoughts on model assessment
The effective region for improvement (difference between null and true models) is often narrow, especially in binary outcome models
References
Hastie, T., Tibshirani, R. and Friedman, J., 2009. The elements of statistical learning.
Steyerberg, E.W., Vickers, A.J., Cook, N.R., Gerds, T., Gonen, M., Obuchowski, N., Pencina, M.J. and Kattan, M.W., 2010. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology, 21(1), pp.128-138.